In this report we will do detailed analysis on different chemical composition of white wine and its effect on quality.
## [1] 4898 13
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
There are total of 4898 wine samples.
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
Most of the wine in this data falls in the quality score of 5, 6, and 7. There is no wine in the data set with quality less than score of 3 or score of 10.
The fixed acidity, volatile acidity, citric acid, chlorides, pH, sulphates, density have normal distribution.
The residual sugar has bimodal distribution, most wine fall in two sugar values.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
Feature, free sulfur dioxide and total sulfur dioxide has outliers. For free sulfur dioxide the value of median 34, 3rd quadrant is 46 but max value is 289. Similarly for total sulfur dioxide median is 134, 3rd quadrant is 167 but max value is 440.
The alcohol seems to have uniform distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 1225 2450 2450 3674 4898
The variable x has right skewed distribution.
The data set has 4898 wines with 13 features (X, fixed acidity, volatile acidity, Citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, Density, ph, sulphates, alcohol, quality). Following are observations made about the data * Most of the wine in this data falls in the quality score of 5, 6 and 7. * There is no wine in the data set with quality less than score of 3 or score of 10. * Feature alcohol is uniformly distributed. * Feature X is negatively skewed. * Other features (fixed acidity, volatile acidity, Citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, ph, sulphates, alcohol, quality) are normally distributed.
The quality is the main feature of interest, I will find out how other features will influence quality. I strongly suspect residual sugar has some relationship with the quality of the wine. Logically it makes sense for alcohol content to have some relationship with wine quality. From this univariant analysis it very difficult to establish any relationship between quality and other features.
The feature X has unusual shape with histogram with default values, so I applied log transform to the data to obtain right skewed distribution. With most values falling around the value 2500. Residual sugar value is transformed from left skewed to bimodal distribution most of the wine falling around 3 or 9. The bin values is adjusted in all the histograms.
## X fixed.acidity volatile.acidity citric.acid residual.sugar
## 733 733 8.8 0.280 0.45 6.0
## 2252 2252 7.4 0.180 0.29 1.4
## 3169 3169 6.2 0.190 0.38 5.1
## 1543 1543 7.2 0.160 0.49 1.3
## 499 499 5.7 0.335 0.34 1.0
## 376 376 5.1 0.330 0.22 1.6
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 733 0.022 14 49 0.99340 3.01
## 2252 0.042 34 101 0.99384 3.54
## 3169 0.019 22 82 0.98961 3.05
## 1543 0.037 27 104 0.99240 3.23
## 499 0.040 13 174 0.99200 3.27
## 376 0.027 18 89 0.98930 3.51
## sulphates alcohol quality
## 733 0.33 11.1 7
## 2252 0.60 10.5 7
## 3169 0.36 12.5 6
## 1543 0.57 10.6 6
## 499 0.66 10.0 5
## 376 0.38 12.5 7
##
## Two-Step Estimates
##
## Correlations/Type of Correlation:
## X fixed.acidity volatile.acidity citric.acid
## X 1 Pearson Pearson Pearson
## fixed.acidity -0.2558 1 Pearson Pearson
## volatile.acidity 0.002858 -0.0227 1 Pearson
## citric.acid -0.1499 0.2892 -0.1495 1
## residual.sugar 0.006624 0.08902 0.06429 0.09421
## chlorides -0.04565 0.02309 0.07051 0.1144
## free.sulfur.dioxide -0.01193 -0.0494 -0.09701 0.09408
## total.sulfur.dioxide -0.162 0.09107 0.08926 0.1211
## density -0.186 0.2653 0.02711 0.1495
## pH -0.1158 -0.4259 -0.03192 -0.1637
## sulphates 0.009808 -0.01714 -0.03573 0.06233
## alcohol 0.2137 -0.1209 0.06772 -0.07573
## quality 0.03576 -0.1137 -0.1947 -0.009209
## residual.sugar chlorides free.sulfur.dioxide
## X Pearson Pearson Pearson
## fixed.acidity Pearson Pearson Pearson
## volatile.acidity Pearson Pearson Pearson
## citric.acid Pearson Pearson Pearson
## residual.sugar 1 Pearson Pearson
## chlorides 0.08868 1 Pearson
## free.sulfur.dioxide 0.2991 0.1014 1
## total.sulfur.dioxide 0.4014 0.1989 0.6155
## density 0.839 0.2572 0.2942
## pH -0.1941 -0.09044 -0.0006178
## sulphates -0.02666 0.01676 0.05922
## alcohol -0.4506 -0.3602 -0.2501
## quality -0.09758 -0.2099 0.008158
## total.sulfur.dioxide density pH sulphates
## X Pearson Pearson Pearson Pearson
## fixed.acidity Pearson Pearson Pearson Pearson
## volatile.acidity Pearson Pearson Pearson Pearson
## citric.acid Pearson Pearson Pearson Pearson
## residual.sugar Pearson Pearson Pearson Pearson
## chlorides Pearson Pearson Pearson Pearson
## free.sulfur.dioxide Pearson Pearson Pearson Pearson
## total.sulfur.dioxide 1 Pearson Pearson Pearson
## density 0.5299 1 Pearson Pearson
## pH 0.002321 -0.09359 1 Pearson
## sulphates 0.1346 0.07449 0.156 1
## alcohol -0.4489 -0.7801 0.1214 -0.01743
## quality -0.1747 -0.3071 0.09943 0.05368
## alcohol quality
## X Pearson Pearson
## fixed.acidity Pearson Pearson
## volatile.acidity Pearson Pearson
## citric.acid Pearson Pearson
## residual.sugar Pearson Pearson
## chlorides Pearson Pearson
## free.sulfur.dioxide Pearson Pearson
## total.sulfur.dioxide Pearson Pearson
## density Pearson Pearson
## pH Pearson Pearson
## sulphates Pearson Pearson
## alcohol 1 Pearson
## quality 0.4356 1
##
## Standard Errors:
## X fixed.acidity volatile.acidity citric.acid
## X
## fixed.acidity 0.01336
## volatile.acidity 0.01429 0.01428
## citric.acid 0.01397 0.0131 0.01397
## residual.sugar 0.01429 0.01418 0.01423 0.01416
## chlorides 0.01426 0.01428 0.01422 0.0141
## free.sulfur.dioxide 0.01429 0.01426 0.01416 0.01416
## total.sulfur.dioxide 0.01392 0.01417 0.01418 0.01408
## density 0.0138 0.01328 0.01428 0.01397
## pH 0.0141 0.0117 0.01428 0.01391
## sulphates 0.01429 0.01429 0.01427 0.01423
## alcohol 0.01364 0.01408 0.01422 0.01421
## quality 0.01427 0.01411 0.01375 0.01429
## residual.sugar chlorides free.sulfur.dioxide
## X
## fixed.acidity
## volatile.acidity
## citric.acid
## residual.sugar
## chlorides 0.01418
## free.sulfur.dioxide 0.01301 0.01414
## total.sulfur.dioxide 0.01199 0.01372 0.008878
## density 0.004233 0.01334 0.01305
## pH 0.01375 0.01417 0.01429
## sulphates 0.01428 0.01429 0.01424
## alcohol 0.01139 0.01244 0.0134
## quality 0.01415 0.01366 0.01429
## total.sulfur.dioxide density pH sulphates
## X
## fixed.acidity
## volatile.acidity
## citric.acid
## residual.sugar
## chlorides
## free.sulfur.dioxide
## total.sulfur.dioxide
## density 0.01028
## pH 0.01429 0.01416
## sulphates 0.01403 0.01421 0.01394
## alcohol 0.01141 0.005594 0.01408 0.01429
## quality 0.01385 0.01294 0.01415 0.01425
## alcohol
## X
## fixed.acidity
## volatile.acidity
## citric.acid
## residual.sugar
## chlorides
## free.sulfur.dioxide
## total.sulfur.dioxide
## density
## pH
## sulphates
## alcohol
## quality 0.01158
##
## n = 4898
##
## P-values for Tests of Bivariate Normality:
## X fixed.acidity volatile.acidity citric.acid
## X
## fixed.acidity 1.384e-135
## volatile.acidity 4.43e-79 8.326e-51
## citric.acid 8.099e-177 7.094e-126 3.11e-162
## residual.sugar 1.269e-153 3.961e-142 3.871e-146 6.704e-208
## chlorides 0 0 0 0
## free.sulfur.dioxide 2.436e-59 9.489e-44 2.307e-50 1.481e-110
## total.sulfur.dioxide 4.165e-65 1.731e-38 3.649e-49 2.145e-108
## density 6.906e-101 2.053e-49 1.458e-45 1.894e-132
## pH 2.823e-57 5.114e-36 2.379e-36 3.439e-101
## sulphates 1.308e-56 1.076e-33 4.068e-33 4.195e-103
## alcohol 3.053e-105 1.172e-74 1.458e-96 6.265e-186
## quality 0 0 0 0
## residual.sugar chlorides free.sulfur.dioxide
## X
## fixed.acidity
## volatile.acidity
## citric.acid
## residual.sugar
## chlorides 0
## free.sulfur.dioxide 2.279e-119 0
## total.sulfur.dioxide 9.659e-122 0 2.231e-30
## density 3.89e-196 0 1.384e-52
## pH 1.085e-119 0 3.012e-24
## sulphates 2.257e-116 0 1.06e-18
## alcohol 3.624e-202 0 9.643e-71
## quality 0 0 0
## total.sulfur.dioxide density pH sulphates
## X
## fixed.acidity
## volatile.acidity
## citric.acid
## residual.sugar
## chlorides
## free.sulfur.dioxide
## total.sulfur.dioxide
## density 1.193e-28
## pH 3.591e-17 1.448e-34
## sulphates 6.053e-32 1.796e-35 1.473e-17
## alcohol 2.343e-57 3.223e-108 2.598e-62 3.961e-84
## quality 0 0 0 0
## alcohol
## X
## fixed.acidity
## volatile.acidity
## citric.acid
## residual.sugar
## chlorides
## free.sulfur.dioxide
## total.sulfur.dioxide
## density
## pH
## sulphates
## alcohol
## quality 0
The Pearson R between quality and features are quiet low. The alcohol and quality has highest Pearson R at 0.4356. Other features that might influence quality are fixed acidity, volatile acidity,residual sugar,chlorides, total sulfur dioxide, density and sulphates.
##
## Pearson's product-moment correlation
##
## data: wine$alcohol and wine$quality
## t = 33.858, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4126015 0.4579941
## sample estimates:
## cor
## 0.4355747
From the scatter plot between alcohol and quality we can see alcohol quality at 5,6 and 7 have range of alcohol content from 8 percent to 13 percent. The lower quality wine of 5 and below have alcohol content predominantly in the range of 8 and 11.
##
## Pearson's product-moment correlation
##
## data: wine$alcohol and wine$density
## t = -87.255, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7908646 -0.7689315
## sample estimates:
## cor
## -0.7801376
Density and alcohol seems to have negative correlation with Pearson R value of -0.7801. Wine with higher alcohol content have lower density.
##
## Pearson's product-moment correlation
##
## data: wine$residual.sugar and wine$density
## t = 107.87, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8304732 0.8470698
## sample estimates:
## cor
## 0.8389665
The residual sugar and density has positive correlation with correlation value of 0.839
As expected the density and quality is opposite of alcohol and quality distribution. For instance lower wine quality have predominantly higher quality and lower density.
It is difficult to establish any relationship between fixed acidity and quality. The volatile acidity has lot of outliers, even after removing outliers I cannot establish any relationship between quality and fixed acidity. All the quality values has similar distribution of volatile acidity and fixed acidity.
The residual sugar has lot of outliers, so only top 99 percentile is taken into analysis. From the scatter plot we can infer that wine quality which is 4 or below has predominantly lower residual sugar. Wine quality of 5 and above have similar distribution of residual sugar to one another. This is surprise as I was expecting distribution similar to density and quality, but it was similar to alcohol and quality.
The chlorides has some outliers so only top 99 percentile is considered. The middle wine quality has wide range of chlorides on the other hand lower and higher wine quality has lower chloride values. This may be because middle wine quality has more sample and hence the variance is quiet high.
No relationship could be drawn using quality and total sulfur dioxide.
Different wine quality of wine has similar distribution of sulphates.
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
To do box plot analysis, quality is converted to factor variable.
The wine quality of 6 and above have higher median alcohol value, median alcohol value have increasing trend from alcohol quality 6 and above. As expected we can see exact opposite trend with density.
I could not find any definite pattern with boxplot for quality with residual sugar, total sulfur dioxide, sulphates.
Lower quality wine has higher median chloride content compared to higher wine quality.
We can find relationship between quality and alcohol with correlation coefficient of 0.4356. The higher quality of wine has higher the alcohol content. The higher wine quality of 7, 8 and 9 have higher median alcohol content compared to lower wine quality
The next feature that influence the wine quality is density, it has correlation value of -0.3071, the density is physical property which is affected by other chemical feature that is present in the wine, in our case it is affected by alcohol and residual sugar. I strongly suspect density does not affect wine quality in a big way as the density itself affected by presence of other chemicals.
The chloride has negative correlation with wine quality with correlation coefficient of -0.2099. The higher wine quality of 7, 8 and 9 have lower median chloride content compared to lower wine quality.
The other features that seems to have effect on wine quality are fixed acidity, volatile acidity, residual sugar, total sulfur dioxide and sulphates. Further analysis is needed to determine the relationship between these feature and wine quality.
The density strongly correlates with residual sugar. The correlation coefficient between density and residual sugar is 0.839.
There is strong negative correlation between alcohol and density, higher the percentage of alcohol lower is the density.
I found strong positive correlation between alcohol and quality. The density had strong negative correlation. The chloride is another feature that has negative correlation.
##
## high low medium
## 1060 1640 2198
The wine is divided into three categories of low, medium and high. Quality value less than 6 is categorized as low, wine quality of 6 is categorized as medium and wine quality of greater than 6 is categorized as high.
The low wine quality has most of alcohol value of 11 or lower and volatile acidity in range of 0.2 to 0.6. The medium wine quality has alcohol content are predominantly 11 or lower and volatile acidity in range of 0.1 to 0.5. The high wine quality has alcohol content that are predominantly 11 or higher and volatile acidity in range of 0.1 to 0.5. This behavior is quiet expected as positive correlation between alcohol and wine quality whereas we have negative correlation between volatile acidity and wine quality.
Low and medium wine quality has most of fixed acidity value from 5 to 8.5 and alcohol content less than 11. Whereas high wine quality has most of the values from 5 to 7.5 and alcohol content higher than 11. This is consistent with our correlation analysis.
Low and medium wine quality has most of Residual Sugar value from 2 to 20 and alcohol content less than 11. Whereas high wine quality has most of the values from 2 to 13 and alcohol content higher than 11. This is consistent with our correlation analysis.
## X fixed.acidity volatile.acidity citric.acid
## Min. : 11 Min. : 4.200 Min. :0.1000 Min. :0.0000
## 1st Qu.:1135 1st Qu.: 6.400 1st Qu.:0.2400 1st Qu.:0.2400
## Median :2238 Median : 6.800 Median :0.2900 Median :0.3200
## Mean :2318 Mean : 6.962 Mean :0.3103 Mean :0.3343
## 3rd Qu.:3533 3rd Qu.: 7.500 3rd Qu.:0.3500 3rd Qu.:0.4100
## Max. :4895 Max. :11.800 Max. :1.1000 Max. :1.0000
##
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.04000 1st Qu.: 20.00
## Median : 6.625 Median :0.04700 Median : 34.00
## Mean : 7.054 Mean :0.05144 Mean : 35.34
## 3rd Qu.:11.025 3rd Qu.:0.05300 3rd Qu.: 49.00
## Max. :23.500 Max. :0.34600 Max. :289.00
##
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9872 Min. :2.79 Min. :0.2500
## 1st Qu.:117.0 1st Qu.:0.9932 1st Qu.:3.08 1st Qu.:0.4100
## Median :149.0 Median :0.9951 Median :3.16 Median :0.4700
## Mean :148.6 Mean :0.9952 Mean :3.17 Mean :0.4815
## 3rd Qu.:182.0 3rd Qu.:0.9971 3rd Qu.:3.24 3rd Qu.:0.5300
## Max. :440.0 Max. :1.0024 Max. :3.79 Max. :0.8800
##
## alcohol quality qualityfactor qualityLabel
## Min. : 8.00 Min. :3.000 3: 20 Length:1640
## 1st Qu.: 9.20 1st Qu.:5.000 4: 163 Class :character
## Median : 9.60 Median :5.000 5:1457 Mode :character
## Mean : 9.85 Mean :4.876 6: 0
## 3rd Qu.:10.40 3rd Qu.:5.000 7: 0
## Max. :13.60 Max. :5.000 8: 0
## 9: 0
Low and medium wine quality has most of Free Sulfur Dioxide value from 10 to 60 and alcohol content less than 11. Whereas high wine quality has most of the values from 25 to 50 and alcohol content higher than 11. This explains the weak correlation between wine quality and free sulfur dioxide.
Low and medium wine quality has most of sulphates value from .3 to .6 and alcohol content less than 11. Whereas high wine quality has most of the values from .25 to .7 and alcohol content higher than 11. This is consistent with our correlation analysis.
Chlorides and free sulfur dioxide for different alcohol content does not seems to have any effect on quality.
## alcohol volatile.acidity density
## 5.142229 1.041865 16.008081
## fixed.acidity residual.sugar free.sulfur.dioxide
## 1.406153 7.233439 1.147541
## sulphates
## 1.125417
## alcohol volatile.acidity fixed.acidity
## 1.303170 1.029825 1.026391
## residual.sugar free.sulfur.dioxide sulphates
## 1.346054 1.147219 1.006988
Since density is affected by alcohol and residual sugar, before the linear regression analysis variation inflation factor (vif) of all our variables has to be checked. The density has high vif and after removing density, vif for other variables are acceptable.
Thus we will use alcohol, volatile acidity, fixed acidity, residual sugar, free sulfur dioxide and sulphates to create linear model with quality.
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality" "qualityfactor" "qualityLabel"
##
## Calls:
## m1: lm(formula = quality ~ alcohol, data = wine)
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = wine)
## m3: lm(formula = quality ~ alcohol + volatile.acidity + fixed.acidity,
## data = wine)
## m4: lm(formula = quality ~ alcohol + volatile.acidity + fixed.acidity +
## residual.sugar, data = wine)
## m5: lm(formula = quality ~ alcohol + volatile.acidity + fixed.acidity +
## residual.sugar + free.sulfur.dioxide, data = wine)
## m6: lm(formula = quality ~ alcohol + volatile.acidity + fixed.acidity +
## residual.sugar + free.sulfur.dioxide + pH, data = wine)
## m7: lm(formula = quality ~ alcohol + volatile.acidity + fixed.acidity +
## residual.sugar + free.sulfur.dioxide + sulphates, data = wine)
##
## ====================================================================================================
## m1 m2 m3 m4 m5 m6 m7
## ----------------------------------------------------------------------------------------------------
## (Intercept) 2.582*** 3.017*** 3.548*** 2.919*** 2.663*** 1.901*** 2.446***
## (0.098) (0.098) (0.141) (0.150) (0.157) (0.338) (0.164)
## alcohol 0.313*** 0.324*** 0.319*** 0.370*** 0.377*** 0.377*** 0.378***
## (0.009) (0.009) (0.009) (0.010) (0.010) (0.010) (0.010)
## volatile.acidity -1.979*** -1.988*** -2.119*** -2.052*** -2.043*** -2.040***
## (0.110) (0.109) (0.109) (0.109) (0.109) (0.109)
## fixed.acidity -0.068*** -0.074*** -0.068*** -0.052*** -0.067***
## (0.013) (0.013) (0.013) (0.014) (0.013)
## residual.sugar 0.027*** 0.024*** 0.025*** 0.025***
## (0.002) (0.002) (0.003) (0.002)
## free.sulfur.dioxide 0.004*** 0.004*** 0.004***
## (0.001) (0.001) (0.001)
## pH 0.205*
## (0.081)
## sulphates 0.412***
## (0.095)
## ----------------------------------------------------------------------------------------------------
## R-squared 0.2 0.2 0.2 0.3 0.3 0.3 0.3
## adj. R-squared 0.2 0.2 0.2 0.3 0.3 0.3 0.3
## sigma 0.8 0.8 0.8 0.8 0.8 0.8 0.8
## F 1146.4 773.9 527.7 437.6 358.3 300.0 302.8
## p 0.0 0.0 0.0 0.0 0.0 0.0 0.0
## Log-likelihood -5839.4 -5681.8 -5668.2 -5605.7 -5590.5 -5587.2 -5581.1
## Deviance 3112.3 2918.3 2902.2 2829.0 2811.5 2807.8 2800.7
## AIC 11684.8 11371.6 11346.4 11223.4 11195.0 11190.5 11178.2
## BIC 11704.3 11397.5 11378.9 11262.4 11240.5 11242.5 11230.2
## N 4898 4898 4898 4898 4898 4898 4898
## ====================================================================================================
The wine quality is categorized into three categories low, medium and good. Quality value less than 6 is categorized as low, wine quality of 6 is categorized as medium and wine quality of greater than 6 is categorized as high.
Alcohol content together with volatile acidity, fixed acidity, residual sugar, free sulfur dioxide, pH and sulphates seems to have effect on quality of wine.
High alcohol with lower fixed acidity, volatile acidity and residual sugar produce high wine quality. On the other hand high alcohol content with high free sulfur dioxide, pH and sulphates produce high wine quality.
I was expecting some relationship between alcohol and chloride, alcohol and free sulfur dioxide on wine quality. But I was surprised to find no relationship between these features on wine quality.
Yes I created linear model using alcohol, volatile acidity, fixed acidity, residual sugar, free sulfur dioxide, pH and sulphates.
The model accounts for 30% variance in quality of the wine. The density is not added to this model because of high variance inflation factor .The variance observed is not very high and thus it may not be reliable to predict the wine quality based on this model.
Most of the wine for this dataset in available for value of 5,6 and 7. And there is no wine with quality less than zero and wine quality at 10.
From the scatter plot between alcohol and quality we can see alcohol quality at 5, 6 and 7 have range of alcohol content from 8 percent to 13 percent. This is because wine quality with these values have larger wine samples compared to other wine quality.
The lower quality wine of 5 and below have alcohol content predominantly in the range of 8 and 11. The wine quality of 6 and above have higher median alcohol value, median alcohol value have increasing trend from wine quality of 6 and above.
For the above plots we only consider alcohol value from 8 to 14 and residual sugar from 0 to 20 as most of the wine sample fall in this value range. We can see wine quality in range of 5,6 and 7 dominates the plot.
Low and medium wine quality has most of Residual Sugar value from 2 to 20 and alcohol content less than 11. Whereas high wine quality has most of the values from 2 to 13 and alcohol content higher than 11. This is consistent with our correlation analysis.
There are 4898 wine samples in the dataset. I started exploring the data dataset using single variables. Later I formulated some questions and explored some interesting features in the dataset. Finally I explored the relationship between wine quality and other chemical features in the dataset.
Wine quality has positive correlation with alcohol, free sulfur dioxide, pH and sulphates. On the other hand wine quality has negative correlation with density, chlorides, lower fixed acidity, volatile acidity and residual sugar. With further analysis free sulfur dioxide, chlorides has relatively low influence on wine quality. So I created a linear model using alcohol, volatile acidity, fixed acidity, residual sugar, free sulfur dioxide, pH and sulphates. The model accounts for 30% variance in quality of the wine. The density is not added to this model because of high variance inflation factor .The variance observed is not very high and thus it may not be reliable to predict the wine quality based on these features.
The main drawback in the dataset is that wine count for some quality wines is quiet low. There is no wine at wine quality less than 3 and wine quality of 10. Furthermore for wine quality of 3,4,8 and 9 has only 20,163, 175 and 9 wine samples respectively. On the other hand wine quality of 5, 6 and 7 accounts for wine count of 1457, 2198 and 880 respectively. A dataset with evenly distributed wine count for different wine quality would make analysis on wine quality much more reliable and predictive model will be much more accurate.
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot. Similarly message = FALSE parameter was added to